Scrape the Data

Generate a Structure for Scraping

The first step in this process is generating a URL for each page of the forum. Each page has 10 posts, and as of the time of starting this project (5/1/2019). The first post’s URL is the website’s base URL followed by the forum ID number. All subsequent pages are numbered with a “-#” before the final forward-slash. I use a simple for loop to generate a vector of all the URLs in the forum. Next, I make an empty dataframe to put the scraped information into. I gather the following:

  1. Username
  2. Date
  3. Time
  4. Text
#generate a url for each page of the ideology and philosophy forum 
ideo_philo_urls <- c("https://www.stormfront.org/forum/t451603/")

#generate a url for each page 
for(i in 2:502){
  ideo_philo_urls <- c(ideo_philo_urls, paste0("https://www.stormfront.org/forum/t451603-", 
                                               i,
                                               "/"))
}

ideology_forum <- data.frame(user = c(),
                             date = c(),
                             time = c(),
                             text = c())

Loop through the URLs to Scrape the Posts

Now that I have all the URLs of the forum pages in a vector and an empty dataframe to save them in execute the following for loop to scrape all of the data from the forum and put it into a dataframe including the variables mentioned above. I scrape the data in 3 parts: the text itself, the date and time together, and then the usernames. The corresponding parts of the webpage scraping are labeled in the code below. I use the stringr package to extract the data that I want.

In the current date of compilation the last forum page does not have a full 10 comments but the comment extraction temporary object still has a length of 10 while the date, time, and user vectors have fewer than 10. To avoid this mistake causing an error and haulting the knit of the document I specify that the loop adds the new posts to the full dataframe for all loops except the last one. I add the last set of posts into the dataframe separately.

for(i in 1:length(ideo_philo_urls)){
  page <- read_html(url(ideo_philo_urls[i]))
  
#read the text from the posts 
  page_text_prelim <- page %>% 
    html_nodes("#posts .alt1") %>% 
    html_text()
    
#extract the text from the posts. Every other index in this vector is the post, with the remaining indices being missing. 
  page_text <- page_text_prelim[seq(1, 20, 2)]
  
  
  page_date_time <- page %>% 
    html_nodes("#posts .thead:nth-child(1)") %>% 
    html_text() 
  
  page_date_time_prelim <- page_date_time %>% 
                           data.frame() %>% 
                           janitor::clean_names() %>% 
                           mutate(date = stringr::str_extract(x, 
                                                              "\\d{2}\\-\\d{2}\\-\\d{4}"),
                                  time = stringr::str_extract(x, 
                                                              "\\d{2}\\:\\d{2}\\s[A-Z]{2}")) %>% 
                           filter(!is.na(date)) %>%  
                           select(date,
                                  time)
  
  page_date <- as.vector(page_date_time_prelim$date)
  page_time <- as.vector(page_date_time_prelim$time)
  
  page_user_prelim <- page %>% 
                      html_nodes("#posts .alt2") %>% 
                      html_text() %>% 
                      data.frame() %>% 
                      janitor::clean_names() %>% 
                      mutate(text = as.character(x),
                             user_time_detect = as.numeric(stringr::str_detect(text,
                                                             "Posts:")),
                             user = stringr::str_extract(text,
                                       "([A-z0-9]+.)+")) %>% 
                      filter(user_time_detect == 1) %>% 
                      select(user)
  
  page_user <- as.vector(page_user_prelim$user)
  
#as of 5/6/2019 this errors on the final loop because the last page only has 7 posts and the page_date and page_time. I have the following if condition to prevent the last loop from erroring. 
  if(i < 502){
  
  page_df <- data.frame(user = as.character(page_user),
                        date = as.character(page_date),
                        time = as.character(page_time), 
                        text = as.character(page_text))

  
  ideology_forum <- rbind(ideology_forum, page_df)
  }
}

#This deals with that last loop that failed 

page_text <- as.vector(na.omit(page_text))
page_df <- data.frame(user = as.character(page_user),
                      date = as.character(page_date),
                      time = as.character(page_time), 
                      text = as.character(page_text))


ideology_forum <- rbind(ideology_forum, page_df)

Clean the Scraped Data

The code below is cleaning the data captured above. There are several problems with irrelevant text in the posts. The first problem is that each post has the first three word-like objects as “Re: National Socialism” because that is the name of the forum. These three words are not relevant to actually discerning what is being discussed and is therefore removed. The second problem is that many of the posters quote each other and outside materials in their back and forth. In this project I am only interested in novel components of each post. Thus, I remove all quoted text from each post. The third problem is line breaks and other control characters. Fourth, I remove all punctuation from the text for more succinct analysis.

The column “text_nore” is the post itself without the initial indicator that it is a response to the forum. Removing the text in this context is pretty straightforward because I only care about exclusively one phrase that does not appear elsewhere.

The column “text_noquote” is the text of the post remaining from text_nore also minus the text in quotes. This was a rather challenging piece of text to address. There is an example post in its raw form below that shows just how tricky this part was to solve. The selected example has three quotes: the first names the user quoted, and the following two do not. These two quotes have an inconsistent form, and thus make it difficult to capture all possible different quotes with one regex pattern. However, enough things are the same to make it work. First, all quotes start with the word “Quote:”, so I can easily identify the start of a quote. Second, all quotes end with two line breaks “\n\n”. In between those posts there are several words, control characters, and punctuation. In order to capture these patterns I use the greediest (and laziest) approach that works: match 0 or more of a pattern that may or may not exist within a quote until the two line breaks are matched at the end of the quote. This ultimately works on all quote types, and the final regex form can be seen below.

## [1] "\n\n\nRe: National Socialism\n\n\n\nQuote:\n\n\nOriginally Posted by Garak\n\n\nEver heard of copy and paste? I'll bet you could find it before I could. Give me a page number for quicker reference perhaps.\n\nSurely, in the same time it took you to post the question of what the difference between Socialism and National Socialism is, plus these other bickering posts, you could of \"copy and pasted\" all you wanted.\nQuote:\n\nHandouts? What the hell are you talking about. If asking you to clarify your position is a handout in your mind your little movement won't go anywhere.\n\nYou come into my thread with an interest in national socialism, but you don't bother to read the thread at all. Instead, you expect everyone else to compensate your laziness by digging through and finding it themselves for you, when we've already done our fair share of explaining it ourselves.Read the thread.\nI'm not even a National Socialist, and it annoys me that you would threaten to \"not support NS\" if we don't beckon to your will. NSers are our brothers, all the same. Unity makes us powerful, dissent breaks us.\nQuote:\n\nBTW, where are you in ND?\n\nWith how you've been acting, I'm not sure if I want to tell you.I don't want this thread to turn toward further argument unrelated to the topic.\n\n\n"

The problem of removing quotes posed another ‘unsolvable’ problem: some posts caused the mutate line for creating “text_noquote” to hang and never finish no matter how long it ran. I isolated 8 posts that were causing this problem via a manual binary sort until I identified the posts causing the problem. The only solution seems to be removing these posts, which is unfortunate. However, I have just over 5000 comments, so it is not THAT big of a deal.

The final two problems are trivial to solve. I remove all control characters into the column “text_nobreak” with the “\c” regex pattern. I remove all punctuation with the “[[:punct:]]” regex pattern. Therefore, the final column created in the code chunk below, text_nopunct, has the cleanest form.

Summary Visuals

This section will provide some visualizations of who is posting, what they are saying, and how much is in their post. The first figure below shows some simple summary information about the top 50 posters. As is obvious from the graph kayden is by far the most frequent poster, followed up by Kaiserreich and John Hawkwood. There is a large discrepency between the top threee themselves, and the top three and the rest of the posters. Each of the top three are separated by about 100 posts, and only 11 users have posted more than 100 times. Another interesting note from this figure is that the top posters are certainly not examples of ‘post frequently, but post little.’ The bars are all shaded with darker shades indicating more words per post (numbers labels the bars are length/10), and each 100+ poster has at least 60 words per post, indicating that they are contributing in some meaningful way to the debate in the forum.

## Warning: Column `user` joining factors with different levels, coercing to
## character vector

The following figure shows the relationship, or lack thereof, between the number of posts and the length of the posts. Using the full dataset the OLS line has basically no relationship between frequency and length of posts, an OLS coefficient of only 0.008. It seems that for those underneath the threshold of 100 posts the relationship is much stronger and postive, but after subsetting the same plot (not shown here) I find that excluding the top posters does not make the relationship stronger.

## Warning: Column `user` joining character vector and factor, coercing into
## character vector

user_month %>% 
  filter(ym < "2012-01") %>% 
  ggplot(aes(x = as.Date(paste0(ym, "-01")),
             y = n_month,
             group = 11)) +
  geom_line(aes(size = mean_length10)) +
  scale_x_date(date_breaks = "3 months",
               date_labels = "%Y-%B") +
  facet_grid(user ~ .) +
  labs(title = "Top Posters pre-2012 Activity",
       x = "Months",
       y = "Posts per Month",
       size = "Average Post Length") +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90,
                                   size = 50),
        axis.text.y = element_text(size = 50),
        axis.title = element_text(size = 50),
        title = element_text(size = 80), 
        plot.title = element_text(hjust = .5),
        legend.text = element_text(size = 40),
        legend.key.size = unit(7, 'lines'),
        legend.title.align = .5,
        legend.background = element_rect(fill = NA),
        strip.text = element_text(size = 55,
                                    angle = 90))

data("stop_words")
library(SnowballC)

unigram <- cleaning %>% 
         mutate(text_clean = str_replace_all(text_nopunct, 
                                             "[Nn]ational\\s[Ss]ocialism[A-z]*", 
                                             "ns") %>% 
                             str_replace_all("[Nn]ational\\s[Ss]ocialist[A-z]*",
                                             "ns") %>% 
                             str_replace_all("[Aa]dolf(\\s)*[Hh]itler[A-z]*",
                                             "hitler") %>% 
                             str_replace_all(".*[Hh]itler",
                                             "hitler") %>% 
                             str_replace_all("[Hh]itler.*",
                                             "hitler") %>%  
                             str_replace_all("[A-z]*[Mm]ein(\\s)*[Kk]ampf[A-z]*",
                                             "meinkampf")) %>% 
         unnest_tokens(word, 
                       text_clean) %>% 
         select(user, 
                time,
                date,
                word,
                id) %>%
         anti_join(stop_words)
## Joining, by = "word"
unigram <- unigram %>% 
         mutate(word_stem = wordStem(word),
                word_stem = ifelse(word == 'hitler' | word == 'ns' | word == 'meinkampf',
                                   as.character(word),
                                   word_stem))
unigram_top50 <- unigram %>% 
                   count(word, sort = T) %>% 
                   mutate(word = reorder(word, n)) %>% 
                   slice(1:50) %>% 
                   mutate(word_stem = wordStem(word),
                          word_stem = ifelse(word == 'hitler' | word == 'ns' | word == 'meinkampf',
                                             word,
                                             word_stem))

bigram <- cleaning %>% 
         mutate(text_clean = str_replace_all(text_nopunct, 
                                             "[Nn]ational\\s[Ss]ocialism[A-z]*", 
                                             "ns") %>% 
                             str_replace_all("[Nn]ational\\s[Ss]ocialist[A-z]*",
                                             "ns") %>% 
                             str_replace_all("[Aa]dolf(\\s)*[Hh]itler[A-z]*",
                                             "hitler") %>% 
                             str_replace_all(".*[Hh]itler",
                                             "hitler") %>% 
                             str_replace_all("[Hh]itler.*",
                                             "hitler") %>%  
                             str_replace_all("[A-z]*[Mm]ein(\\s)*[Kk]ampf[A-z]*",
                                             "meinkampf") %>% 
                             str_replace_all("[Nn]ationalis[tm]",
                                             "national")) %>% 
         unnest_tokens(bigram, 
                       text_clean,
                       token = "ngrams", 
                       n = 2) %>% 
         select(user, 
                time,
                date,
                bigram,
                id)

bigram_split <- bigram %>% 
                separate(bigram, c("word1", 
                                   "word2"), 
                         sep = " ",
                         remove = F) %>% 
                filter(!word1 %in% stop_words$word) %>% 
                filter(!word1 %in% stop_words$word) %>% 
                filter(!is.na(bigram))

bigram_sorted <- bigram_split %>% 
          count(bigram, sort = T) %>% 
          mutate(bigram = reorder(bigram, n))

bigram_sorted %>% 
  slice(1:50) %>% 
  ggplot(aes(y = n,
             x = bigram)) +
  geom_col() +
  coord_flip() + 
  theme_bw()

knitr::kable(unigram_top50 %>% select(-n))
word word_stem
ns 27592
people peopl
white white
hitler 27589
race race
dont dont
government govern
thread thread
time time
world world
system system
racial racial
nation nation
socialism social
german german
political polit
germany germani
economic econom
post post
im im
american american
read read
war war
capitalism capit
power power
jews jew
america america
movement movement
agree agre
life life
money monei
true true
whites white
aryan aryan
national nation
society societi
party parti
idea idea
social social
jewish jewish
means mean
understand understand
free free
history histori
jew jew
country countri
modern modern
future futur
natural natur
religion religion